Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure iopub subscriptions propagate prior to accepting websocket connections #5908

Merged
merged 7 commits into from
Dec 18, 2020

Conversation

SylvainCorlay
Copy link
Member

@SylvainCorlay SylvainCorlay commented Dec 11, 2020

Nudge kernel with info request upon opening websocket until some IOPub message is received.

  • repeat info requests every 500ms until iopub is received
  • busy kernels are not nudged, which would block connections to busy kernels indefinitely

cf jupyter/jupyter_client#593

@jasongrout
Copy link
Member

I thought that there was one set of zmq channels (shell, iopub, etc.) shared across all websocket connections (but I haven't looked at the code carefully recently). Is that wrong, or is that changing in this PR?

@SylvainCorlay SylvainCorlay force-pushed the nudge-kernel branch 2 times, most recently from eb69745 to caa1067 Compare December 11, 2020 21:01
@SylvainCorlay
Copy link
Member Author

I thought that there was one set of zmq channels (shell, iopub, etc.) shared across all websocket connections (but I haven't looked at the code carefully recently). Is that wrong, or is that changing in this PR?

I think that it is a correct statement.

@SylvainCorlay SylvainCorlay force-pushed the nudge-kernel branch 4 times, most recently from b7c088d to 430247d Compare December 12, 2020 09:13
@minrk
Copy link
Member

minrk commented Dec 14, 2020

I thought that there was one set of zmq channels

Each websocket client gets its own zmq connection to the kernel, created here using KernelManager.connect_iopub(), etc.

This is what allows the notebook server to not need client-identifying logic, since reply-routing on e.g. stdin, stream channels is handled in the kernel. IOPub messages are indeed wastefully duplicated, however.

@minrk
Copy link
Member

minrk commented Dec 17, 2020

Having poked and prodded quite a bit now, I understand how both branches need a nudge in case of restart:

  • zmq sockets are not recreated when the kernel is restarted (they shouldn't need to be, as zmq does its own underlying auto reconnect).
  • There is no similar nudge operation triggered on restart for 'existing' connections
  • To prod connections, clients tend to recreate websocket connection on restart (this is not strictly required, but it is a standard practice and may be needed in some cases where the zmq ports change)
  • a 'reconnecting' websocket may not actually mean a new zmq socket (the 'restoring connection' branch is taken), if the same session id is re-used

So it's possible for a 'new' websocket connection, as part of the client's restart steps, to re-use zmq sockets and thus never nudging the new kernel's sockets.

The downside of doing this nudge on every ws connect, even on established, already-nudged zmq sockets, is that websocket connections will not be accepted while the kernel is busy, even though they should be. Currently, it is valid to connect and start sending requests without waiting for iopub, but this can result in the issues folks are seeing. The most robust and efficient nudge would be on:

  1. new zmq connections, and
  2. kernel restarts for all existing sockets

where requests on existing, open sockets are blocked until the nudge resolves, not just connection-opening. That's more complicated, and I think we can go with this solution as first pass, and refine it later, especially if the delay is bothering folks.

The last thing I want to look at before merging is what happens if nudge is still outstanding when the connections are closed, then I think this is okay to merge. Thanks @SylvainCorlay!

@minrk
Copy link
Member

minrk commented Dec 17, 2020

I noticed that there are test failures caused by a change in status messages delivered. The nudge waits for any iopub message, then resolves, which means it will tend to resolve on the 'busy' status message, and the 'idle' status message will end up propagating to the client after resolving the connection. It may be better to wait for specifically the idle message associated with the info request to preserve message expectations, even though this is not strictly required.

@SylvainCorlay
Copy link
Member Author

@minrk I think this is ready to go.

When this gets in I will update the jupyter_server PR.

- connect iopub first (tiny effect on the race!)
- docstrings, log details
- resolve immediately if kernel is busy, rather than setting up timeouts, futures
- use gen.with_timeout instead of separately managed timeout
- use gen.multi to wait for both futures instead of duplicated check in each handler, third Future
- add various cancel conditions (sockets closed, kernel stopped, etc.)
@minrk minrk changed the title Nudge kernel with info requests ensure iopub subscriptions propagate prior to accepting websocket connections Dec 18, 2020
@minrk minrk merged commit 5abcbd3 into jupyter:master Dec 18, 2020
@SylvainCorlay SylvainCorlay deleted the nudge-kernel branch December 18, 2020 11:42
@Zsailer
Copy link
Member

Zsailer commented Dec 18, 2020

Thanks, @SylvainCorlay, @minrk, and @jasongrout!


which in turn triggers cleanup
"""
for f in (info_future, iopub_future):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking - the function argument is f and the loop variable is f. Is that going to cause a problem?

@jasongrout
Copy link
Member

@kevin-bates - what is the plan for a release with this PR merged? We're seeing consistent problems that this PR is designed to fix, and would love to have a release fixing the problem. I'm happy to help with the release process if that is helpful.

@Zsailer
Copy link
Member

Zsailer commented Dec 22, 2020

I can work on a patch release, since @kevin-bates is on vacation this month.

@jasongrout
Copy link
Member

Awesome, thanks @Zsailer! Again, I'm happy to help if needed.

@Zsailer
Copy link
Member

Zsailer commented Dec 22, 2020

Oops, I thought I was a maintainer on PyPI for this repo, but it looks like I was mistaken. 🤦

@jasongrout
Copy link
Member

I thought I was too. Looks like @blink1073 and @minrk are owners of the notebook package on pypi, though - maybe they could make both of us maintainers too?

@blink1073
Copy link
Contributor

I am not an owner, only a maintainer.

@Zsailer
Copy link
Member

Zsailer commented Dec 23, 2020

Notebook 6.1.6 released with this PR included.

lresende pushed a commit to jupyter-server/enterprise_gateway that referenced this pull request Jan 9, 2021
Looks like the changes for jupyter/notebook#5908 introduced
an additional status response that pushed the actual 
kernel_info_reply out of the loop's range. 
Increasing it by one resolves the issue.
@blink1073 blink1073 added this to the 6.2 milestone Mar 18, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants